Building High Quality Databases for Minority Languages such as Galician

نویسندگان

  • Francisco Campillo Díaz
  • Daniela Braga
  • Ana Belén Mourín
  • Carmen García-Mateo
  • Pedro Silva
  • José Miguel Salles Dias
  • Francisco Méndez Pazó
چکیده

This paper describes the result of a joint R&D project between Microsoft Portugal and the Signal Theory Group of the University of Vigo (Spain), where a set of language resources was developed with application to Text–to–Speech synthesis. First, a large Corpus of 10000 Galician sentences was designed and recorded by a professional female speaker. Second, a lexicon with phonetic and grammatical information of over 90000 entries was collected and reviewed manually by a linguist expert. And finally, these resources were used for a MOS (Mean Opinion Score) perceptual test to compare two state–of–the–art speech synthesizers of both groups, the one from Microsoft based on HMM, and the one from the University of Vigo based on unit selection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Acoustic Modeling and Training of a Bilingual ASR System when a Minority Language is Involved

This paper describes our work in developing a bilingual speech recognition system using two SpeechDat databases. The bilingual aspect of this work is of particular importance in the Galician region of Spain where both languages Galician and Spanish coexist and one of the languages, the Galician one, is a minority language. Based on a global Spanish-Galician phoneme set we built a bilingual spee...

متن کامل

Tecnologías del habla y lenguas minoritarias

In this paper we show our latest developments of speech and language technology for two languages: Spanish and Galician. Special attention is devoted to the situation of this minority language: Galician, where the lack of resources puts in danger its inclusion in speech products.

متن کامل

Proactive Learning for Building Machine Translation Systems for Minority Languages

Building machine translation (MT) for many minority languages in the world is a serious challenge. For many minor languages there is little machine readable text, few knowledgeable linguists, and little money available for MT development. For these reasons, it becomes very important for an MT system to make best use of its resources, both labeled and unlabeled, in building a quality system. In ...

متن کامل

CORILGA: a Galician Multilevel Annotated Speech Corpus for Linguistic Analysis

This paper describes the CORILGA (“Corpus Oral Informatizado da Lingua Galega”). CORILGA is a large high-quality corpus of spoken Galician from the 1960s up to present-day, including both formal and informal spoken language from both standard and non-standard varieties, and across different generations and social levels. The corpus will be available to the research community upon completion. Ga...

متن کامل

Algoritmo de stemming para el gallego

The quantity and quality of the resources and tools for natural language processing for a given language depend on such a language. In the Iberian Peninsula, Galician is one of the languages that lack this type of tools and resources. To contribute to their development, this paper shows a stemmer specifically designed for the Galician language. It was first introduced in 2002, but since then it...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010